Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: repro: add --pull #1841

Merged
merged 2 commits into from
Oct 6, 2020
Merged

docs: repro: add --pull #1841

merged 2 commits into from
Oct 6, 2020

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Oct 5, 2020

Per iterative/dvc#4538

You may disregard these recommendations if you used the Edit on GitHub button from dvc.org to improve a doc in place.

❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.

🐛 Please make sure to mention Fix #issue (if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.

Please choose to allow us to edit your branch when creating the PR.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@@ -154,6 +155,9 @@ up-to-date and only execute the final stage.
corresponding pipelines, including the target stages themselves. This option
has no effect if `targets` are not provided.

- `--pull` - try automatically pulling cached outputs if they are not present in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, a few questions:

  1. try - what happens if it fails?
  2. pulling - make it a link to the dvc pull probably
  3. cached outputs - here not sure if it's better to use DVC-tracked outputs. (otherwise when you read it is bit hard mentally since you are they cached but not present in cache).

WDYT?

@jorgeorpinel ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. as it was before, it will simply not restore from run-cache.

Addressed 2 and 3. Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, got it. I guess it's fine for now. No reason to further improve this since we don't have run-cache documented anywhere. So we can keep as is -an advanced option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have run-cache documented anywhere

BTW the run-cache is already mentioned in 6 cmd refs (published) and in the Data Pipelines page of the GS, which I just noticed/realized just now. I thought we were not going to include any info about experiments until it's more stable? Should we remove these mentions or prioritize documenting run-cache? Thanks

Copy link
Contributor Author

@efiop efiop Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Let's not remove it. It is in a semi-official state, people already use it due to cml and other sources. We are on our way to cleaning up the ui overall and publishing experiments.

Run-cache doc by itself doesn't really mean anything to the users, which is why I didn't write it in the summer. It only makes sense in particular commands, so the doc about run-cache internals could wait for the high-level commands.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not remove remove it. It is in a semi-official state

OK, I agree it's best too keep, but it could be problematic that the run-cache mentions are completely out of context (no explanation of the concept anywhere).

Run-cache doc by itself doesn't really mean anything to the users... the doc about run-cache internals could wait...

Much disagree 🐶 I mean it's not so important whether it's a stand-alone doc or a new section in existing doc(s), but the basics about run-cache seem like a quite important thing to document to me.

It only makes sense in particular commands

Yeah anywhere we want to put it as long as it's published would be great since this is already semi-official.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Agreed, I've added the run-cache doc ticket to next sprint, just preliminarily. Thanks 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thank! 🦊

@shcheklein shcheklein temporarily deployed to dvc-landing-repro-pull-uuofdyr October 6, 2020 00:46 Inactive
@shcheklein shcheklein merged commit 15aa177 into master Oct 6, 2020
@efiop efiop deleted the repro_pull branch October 6, 2020 01:05
Comment on lines +158 to +159
- `--pull` - try automatically [pulling](/doc/command-reference/pull) missing
cache for outputs restored from run-cache.
Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back on this. Per iterative/dvc#4538 (comment):

dvc repro --pull pulls regular files, hashes for which might've been restored from the existing run-cache, so kinda like regular dvc pull

Unfortunately I don't understand either one of the explanations. What's the relationship between run-cache and repro --pull? Maybe a step-by-step explanation like 1. Use repro --pull; 2. run-cache is checked before executing commands (default repro behavior I think); 3. Some output hashes are found? (but not the actual files? This is the confusing part); 4. Hashes are looked for in the cache but not found; 5. The files are looked for in remote storage. Something like that

Please @efiop ! Thanks in advance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel Even if we leave the run-cache out, repro --pull would still try to dvc pull outputs that are missing, but the pipeline didn't change. E.g. when you forgot to dvc pull beforehand and you are trying to dvc repro otherwise up-to-date pipeline, so dvc repro --pull will just pull the outputs for such stages instead of trying to reproduce them.

Run-cache is then just a special source of lock files, and repro --pull works the same way as explained above.

Want to point out again that --pull is still a temporary solution that was needed to improve pull --run-cache that is also not complete in a product sense. So I would recommend not spending much time on this, as the product scenario is WIP and there is no reason to optimize the docs for it too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK it makes more sense now, thanks.

In this case I do feel like need to spend enough time understanding what's going on so that when the coming bulk of docs related to new features hit, I'm better prepared. So thanks again for baring with me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last Q @efiop. Does this only check the default remote (if one is set)? Or all remotes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, 2 more questions...

  • Does it check only the local run-cache? Or also the remote run-cache for possible dep/out hashes?
  • What happens if you do repro --pull --no-run-cache? Is the run-cache check skipped?

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only check the default remote (if one is set)? Or all remotes?

Only the default remote right now.

Does it check only the local run-cache? Or also the remote run-cache for possible dep/out hashes?

Yes, only local run-cache.

What happens if you do repro --pull --no-run-cache? Is the run-cache check skipped?

Correct. It will only pull if you have your lock file complete (so hashes are already there, just the outputs are missing from cache), but won't try to use run-cache.

Please feel free to ask any questions, I do understand that this incomplete feature is a bit confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgeorpinel jorgeorpinel mentioned this pull request Oct 20, 2020
jorgeorpinel added a commit that referenced this pull request Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants